{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tuning Your Hyper-Parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use SageMaker tuning to automate the searching process effectively. Specifically, we specify a range, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune. SageMaker hyperparameter tuning will automatically launch multiple training jobs with different hyperparameter settings, evaluate results of those training jobs based on a predefined \"objective metric\", and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will give it a budget (max number of training jobs) and it will complete once that many training jobs have been executed.\n",
    "\n",
    "In this example, we are using SageMaker Python SDK to set up and manage the hyperparameter tuning job. We first configure the training jobs the hyperparameter tuning job will launch by initiating an estimator, which includes:\n",
    "* The container image for the algorithm (XGBoost)\n",
    "* Configuration for the output of the training jobs\n",
    "* The values of static algorithm hyperparameters, those that are not specified will be given default values\n",
    "* The type and number of instances to use for the training jobs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q boto3\n",
    "!pip install -q xgboost==0.90"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import sagemaker\n",
    "import pandas as pd\n",
    "\n",
    "sess   = sagemaker.Session()\n",
    "bucket = sess.default_bucket()\n",
    "role = sagemaker.get_execution_role()\n",
    "region = boto3.Session().region_name\n",
    "\n",
    "sm = boto3.Session().client(service_name='sagemaker', region_name=region)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store -r spark_processing_job_s3_output_prefix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Previous Spark Processing Job Name: {}'.format(spark_processing_job_s3_output_prefix))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Specify the S3 Location of the Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prefix_train = '{}/output/tfidf-labeled-split-balanced-noheader-train'.format(spark_processing_job_s3_output_prefix)\n",
    "prefix_validation = '{}/output/tfidf-labeled-split-balanced-noheader-validation'.format(spark_processing_job_s3_output_prefix)\n",
    "prefix_test = '{}/output/tfidf-labeled-split-balanced-noheader-test'.format(spark_processing_job_s3_output_prefix)\n",
    "\n",
    "balanced_tfidf_without_header_train_path = './{}'.format(prefix_train)\n",
    "balanced_tfidf_without_header_validation_path = './{}'.format(prefix_validation)\n",
    "balanced_tfidf_without_header_test_path = './{}'.format(prefix_test)\n",
    "\n",
    "balanced_tfidf_without_header_train_s3_uri = 's3://{}/{}'.format(bucket, prefix_train)\n",
    "balanced_tfidf_without_header_validation_s3_uri = 's3://{}/{}'.format(bucket, prefix_validation)\n",
    "balanced_tfidf_without_header_test_s3_uri = 's3://{}/{}'.format(bucket, prefix_test)\n",
    "\n",
    "s3_input_train_data = sagemaker.s3_input(s3_data=balanced_tfidf_without_header_train_s3_uri, content_type='text/csv')\n",
    "s3_input_validation_data = sagemaker.s3_input(s3_data=balanced_tfidf_without_header_validation_s3_uri, content_type='text/csv')\n",
    "s3_input_test_data = sagemaker.s3_input(s3_data=balanced_tfidf_without_header_test_s3_uri, content_type='text/csv')\n",
    "\n",
    "print(s3_input_train_data.config)\n",
    "print(s3_input_validation_data.config)\n",
    "print(s3_input_test_data.config)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.amazon.amazon_estimator import get_image_uri \n",
    "import json\n",
    "\n",
    "builtin_container_uri = get_image_uri(region_name=region,                                \n",
    "                                      repo_name='xgboost', \n",
    "                                      repo_version='0.90-2')\n",
    "print(builtin_container_uri)\n",
    "\n",
    "model_output_path = 's3://{}/models/tuner/'.format(bucket)\n",
    "print(model_output_path)\n",
    "\n",
    "xgb_estimator = sagemaker.estimator.Estimator(image_name=builtin_container_uri, \n",
    "                                              role=role, \n",
    "                                              train_instance_count=2,\n",
    "                                              train_instance_type='ml.c5.4xlarge',\n",
    "                                              output_path=model_output_path, \n",
    "                                              sagemaker_session=sess,\n",
    "                                              # enable_cloudwatch_metrics=True\n",
    "                                             )\n",
    "\n",
    "xgb_estimator.set_hyperparameters(objective='binary:logistic',\n",
    "                                  num_round=1,\n",
    "                                  max_depth=5)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Setup Hyper-Parameter Ranges to Explore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.tuner import IntegerParameter\n",
    "from sagemaker.tuner import ContinuousParameter\n",
    "from sagemaker.tuner import HyperparameterTuner\n",
    "\n",
    "hyperparameter_ranges = {\n",
    "    'alpha': ContinuousParameter(0, 1000, scaling_type=\"Auto\"),\n",
    "    'colsample_bylevel': ContinuousParameter(0.1, 1,scaling_type=\"Logarithmic\"),\n",
    "    'colsample_bytree': ContinuousParameter(0.5, 1, scaling_type='Logarithmic'),\n",
    "    'eta': ContinuousParameter(0.1, 0.5, scaling_type='Logarithmic'),\n",
    "    'gamma':ContinuousParameter(0, 5, scaling_type='Auto'),\n",
    "    'lambda': ContinuousParameter(0,100,scaling_type='Auto'),\n",
    "    'max_delta_step': IntegerParameter(0,10,scaling_type='Auto'),\n",
    "    'max_depth': IntegerParameter(0,10,scaling_type='Auto'),\n",
    "    'min_child_weight': ContinuousParameter(0,10,scaling_type='Auto'),\n",
    "    'num_round': IntegerParameter(1,4000,scaling_type='Auto'),\n",
    "    'subsample': ContinuousParameter(0.5,1,scaling_type='Logarithmic')}\n",
    "\n",
    "objective_metric_name = 'validation:auc'\n",
    "\n",
    "tuner = HyperparameterTuner(\n",
    "    xgb_estimator,\n",
    "    objective_metric_name,\n",
    "    hyperparameter_ranges,\n",
    "    max_jobs=1,\n",
    "    max_parallel_jobs=1,\n",
    "    strategy='Bayesian'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Start the Hyper-Parameter Tuning Experiment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tuner.fit({'train': s3_input_train_data, \n",
    "           'validation': s3_input_validation_data}, \n",
    "           include_cls_metadata=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pprint import pprint\n",
    "\n",
    "job_description = sm.describe_hyper_parameter_tuning_job(\n",
    "    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name\n",
    ")\n",
    "\n",
    "status = job_description['HyperParameterTuningJobStatus']\n",
    "\n",
    "print('\\n')\n",
    "print(status)\n",
    "print('\\n')\n",
    "pprint(job_description)\n",
    "\n",
    "if status != 'Completed':\n",
    "    job_count = tuning_job_result['TrainingJobStatusCounters']['Completed']\n",
    "    print('Not yet complete, but {} jobs have completed.')\n",
    "    \n",
    "    if job_description.get('BestTrainingJob', None):\n",
    "        print(\"Best candidate:\")\n",
    "        pprint(tuning_job_result['BestTrainingJob']['TrainingJobName'])\n",
    "        pprint(tuning_job_result['BestTrainingJob']['FinalHyperParameterTuningJobObjectiveMetric'])\n",
    "    else:\n",
    "        print(\"No training jobs have reported results yet.\")    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compare Candidates with SageMaker Experiments\n",
    "\n",
    "Model tuning automatically creates a new experiment and tracks the results of each job within the tuning experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.analytics import HyperparameterTuningJobAnalytics\n",
    "\n",
    "exp = HyperparameterTuningJobAnalytics(\n",
    "    sagemaker_session=sess, \n",
    "    hyperparameter_tuning_job_name=tuner.latest_tuning_job.name\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "df = exp.dataframe()\n",
    "\n",
    "df.sort_values('FinalObjectiveValue', ascending=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
